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ABSTRACT 

We present a semi-supervised method for photometric supernova typing. Our approach 
is to first use the nonlinear dimension reduction technique diffusion map to detect 
structure in a database of supernova light curves and subsequently employ random 
forest classification on a spectroscopically confirmed training set to learn a model 
that can predict the type of each newly observed supernova. We demonstrate that 
this is an effective method for supernova typing. As supernova numbers increase, our 
semi-supervised method efficiently utilizes this information to improve classification, 
a prope rty not enjoyed by tem plate based methods. Applied to supernova data simu- 
lated bv lKessler et al.l (|20f0bt ) to mimic those of the Dark Energy Survey, our methods 
achieve (cross-validated) 95% Type la purity and 87% Type la efficiency on the spec- 
troscopic sample, but only 50% Type la purity and 50% efficiency on the photometric 
sample due to their spectroscopic follow-up strategy. To improve the performance on 
the photometric sample, we search for better spectroscopic follow-up procedures by 
studying the sensitivity of our machine learned supernova classification on the specific 
strategy used to obtain training sets. With a fixed amount of spectroscopic follow-up 
time, we find that, despite collecting data on a smaller number of supernovae, deeper 
magnitude-limited spectroscopic surveys are better for producing training sets. For su- 
pernova la (II-P) typing, we obtain a 44% (1%) increase in purity to 72% (87%) and 
30% (162%) increase in efficiency to 65% (84%) of the sample using a 25th (24.5th) 
magnitude-limited survey instead of the shallower spectroscopic sample used in the 
original simulations. When redshift information is available, we incorporate it into 
our analysis using a novel method of altering the diffusion map representation of the 
supernovae. Incorporating host redshifts leads to a 5% improvement in Type la purity 
and 13% improvement in Type la efficiency. 

Key words: methods: data analysis - methods: statistical - techniques: photometric 
- supernovae: general - surveys 



1 INTRODUCTION 

Novel approaches to photometric supernova (SN) classifi- 
cation are in high demand in the astronomical community. 
The next generation of survey telescop es, such as the Dark 
Energy Survey fDES; lA~nms et aUl201lf) and the L arge Syn- 
optic Survey Telescope (LSST; Ivezic et alj 120081 ') . are ex- 
pected to observe light curves for a few hundred thousand 
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supernovae (SNe), far surpassing the resources available to 
spectroscopically confirm the type of each. To fully exploit 
these large samples, it is necessary to develop methods that 
can accurately and automatically classify large samples of 
SNe based only on their photometric light curves. 

In order to use Type la supernovae as cosmological 
probes, it is imperative that pure and efficient Type la sam- 
ples are constructed. Yet, classifying SNe from their light 
curves is a challenging problem. The light flux measurements 
are often noisy, nonuniform in time, and incomplete. In par- 
ticular, it is difficult to discern the light curves of Type la 
SNe from those of Type lb or Ic supernovae, explosive events 
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which result from the core collapse of massive stars. This dif- 
ficulty can have dire effects on the subsequent cosmological 
inferences: if the sample of SNe la used to make cosmolog- 
ical inferences is impure, then the cos mological param eter 
estimates can be significantly biased l|Homeierl l2005t ). On 
the other hand, if the classification of Type la SNe is not 
efficient, then cosmologists do not fully utilize the sample of 
supernovae on hand, resulting in a loss of precision. 

In the last decade, several supernova pho- 
tometric classification methods have been intro- 
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approaches uses some version of template fitting, where 
each observed supernova's data is compared to the data 
from well established SNe of different types and the 
subsequent classification is estimated as the SN type 
whose template fits best (usually judged by maximum 
likelihood or maximum posterior probability). Usually, 
the sets of templates are constructed using the observed 
spec tra of sets of well studied , high signal-to-noise SNe 
(see iNugent. Kim fc Perlmutterl |2002| ) or spectral energy 
distribution models of SNe. 

The principal drawback of using template methods for 
classification is that they depend heavily on the templates 
used for each of the different classes. If there are errors in 
the template creation, these will propagate down to the es- 
timated classifications. Furthermore, template fitting often 
assumes that each observed SN can be well modeled by one 
of the templates, an assumption that may be unreasonable, 
especially for large data sets. Additionally, these methods 
require that all relevant parameters (such as redshift and 
extinction) in the light curve model be simultaneously fit 
or otherwise estimated, which can pose computational chal- 
lenges and cause catastrophic errors when the estimates are 
poor. 

A viable alternative to template fitting for SN classifi- 
cation is semi-supervised learning, a class of methods that 
are able to learn the low dimensional structure in all avail- 
able data and exploit that structure in classification; as 
more data are obtained, these methods are able to utilize 
that additional information to better classify all of the SNe. 
Template fitting methods, on the other hand, do not au- 
tomatically learn as new data are collected (for example, 
ISako et al.l [20T1I extract only 8 core-collapse SN templates 
from over 10,000 observed supernova candidates). Adverse 
effects caused by incorrectly classified templates, under- 
sampled regions of template parameter space, or unrealistic 
templates will not go away as the number of SNe increases. 
Indeed, the magnitude of these biases relative to the sta- 
tistical errors in the estimates will only increase. Well con- 
structed semi-supervised approaches will reduce both bias 
and variance as the sample sizes grow. 

Our main contribution in this paper is to introduce 
a method for SN photometric classification that does not 
rely on template fitting. Our proposed method uses semi- 
supervised learning on a database of SNe: we first use all of 
the light curves to simultaneously estimate an appropriate, 
low dimensional representation of each SN, and then we em- 



ploy a set of labeled (spectroscopically confirmed) examples 
to build a classification model in this reduced space, which 
we subsequently use to estimate the type of each unknown 
SN. 

An advantage to our semi-supervised approach is that 
it learns from the set of unlabeled SNe. Typically there 
are many more unlabeled than labeled supernovae, meaning 
that a method that learns from all the data is an obvious 
improvement over methods that only learn from the labeled 
SNe. Another advantage is that our method gives an accu- 
rate prediction of the class of each supernova without having 
to simultaneously estimate nuisance parameters such as red- 
shift, stretch or reddening. This result arises because vari- 
ations in these parameters appear as gradual variations in 
the low dimensional representation of the light curves when 
the observed data (labeled plus unlabeled SNe) are collected 
densely enough over the nuisance parameters. In the low di- 
mensional representation, the variability due to supernova 
type is greater than the variability due to the nuisance pa- 
rameters, allowing us to build an accurate classifier in this 
space. 

Until recently, there had been no standard testing pro- 
cedure for the various SN classificati on methods. To assess 
the performance of these methods, iKessler et al.l l|2010b[ ) 
held a public "SN Photometric Classification Challenge" in 
which they released a blended mix of simulated supernovae 
(la, lb, Ic, II) to be classified. As part of the Challenge, a 
spectroscopically confirmed subset of SNe was included on 
which the Challenge participants could train or tune their 
algorit hms. The results of that Challenge l|Kessler et al.l 
l2010al ) showed that various machine learning classification 
algorithms were competitive with classical template fitting 
classifiers. Apart from our entry (InCA0), none of the other 
entries to the Classification Challenge used semi-supervised 
learning. In this paper we will further explore the viability 
of semi-super vised classifiers for SN classification. 

Recently, iNewling et al.l (|201ll 'l released a paper detail- 
ing their entries into the SN Challenge. They argue that 
their methods are comparable to the best template fitting 
techniques, but that they require training samples that are 
representative of the full (photometric) sample. Our findings 
are similar, and we carry out a detailed sensitivity analysis 
to determine how the accuracy of predicted S N types depend 
on ch aracteristics of the training set. Like INewling et alJ 
we perform our analysis both with and without pho- 
tometric redshift estimates; we introduce a novel and effec- 
tive way of using photo-z estimates in finding a low dimen- 
sional embedding of SN data. 

Based on our sensitivity analysis, we conclude that 
magnitude-limited spectroscopic follow-up strategies with 
deep limits (25th mag) produce the best training sets for our 
supernova classification method, at no extra observing cost. 
Though these deeper observing strategies result in fewer su- 
pernovae than shallower samples, they produce training sets 
that more closely resemble the entire population of SNe un- 
der study, causing overwhelming improvements in supernova 
classification. We strongly recommend that spectroscopic 
SN follow-up is performed with faint magnitude limits. 
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The layout of the paper is the following. In [J2] we de- 
scribe our semi-supervised classification method, which is 
based on the diffusion map method for nonlinear dimen- 
sionality reduction and the random forest classification tech- 
nique. We couch our description in general terms since el- 
ements of our methodology can in principle be applied to 
any set of light curves, not just those from supernovae. In 
§33(41 we describe the application of our methodology to 
the problem of classifying supernovae, fully detailing the 
steps in our analysis. These sections are divided into the 
unsupervised (^3]) and supervised (Q parts of the analysis. 
We assess the performance of our methods in ij5] using the 
21,319 light curves s imulated for the SN Photometric Clas- 
sification Challenge l|Kessler et alJliblObl ). We examine the 
sensitivity of our results to the composition of the training 
datasets, and to the treatment of host-galaxy redshift (by 
either using redshifts to alter the diffusion map construction 
or using them as covariates in random forest classification). 
In we offer a summary and our conclusions. We provide 
further information on classification trees and the random 
forest algorithm in Appendices lAl and [Bl 



2 METHODOLOGY: SEMI-SUPERVISED 
CLASSIFICATION 

Suppose we have observed photometric light curves for TV 
objects. For each, the observed data are flux measurements 
at irregularly-spaced time points in each of a number of dif- 
ferent filters (e.g., griz), plus associated measurement er- 



{tik, Fik, ffifc}, where 



fc = 1, p<, 6 indexes the filter, and p\ is the number of flux 



rors. Call the data for object i, Xi 

,pI 

measurements in filter b for object i. Here, is the time grid, 
and Fj and u\ are the flux measurements and errors, respec- 
tively, in filter b. Suppose, without loss of generality, that the 
first n objects, Xi, X„, have known types y±, y n . Our 
goal is to predict the type of each of the remaining TV — n 
objects. 

To perfor m this classification, we take a semi - supervised 
approach (see IChapelle. Scholkopf fc Zienl [2006 for an in- 
troduction to semi-supervised learning methods). The basic 
idea is to use the data from all TV objects (both labeled and 
unlabeled) to find a low dimensional representation of the 
data (the unsupervised part) and then to use just the n la- 
beled objects to train a classifier (the supervised part). Our 
proposed procedure is as follows: 

(i) Do relevant preprocessing to the data. Because this is 
application specific, we defer details on our preprocessing of 
SN light curves to the next section. Here, it suffices to state 
that the result of this preprocessing for one object is a set of 
light curve function estimates interpolated over a fine time 
grid. 

(ii) Use the preprocessed light curves for all TV objects 
to learn a low dimensional, data driven embedding of 
{Xi, Xjv}. We use the diffusion map method for non- 
linear dimensionality reduction ( 32.11 1. 

(iii) Train a classification model on the n labeled (spec- 
troscopically confirmed) examples that predicts class as a 
function of the diffusion map coordinates of each object. We 
use the random forest method ( £|2.2[) as our classifier. 

(iv) Use the estimated classification model to predict the 
type of each of the TV — n unlabeled objects. 



The advantage of the semi-supervised approach is that it 
uses all of the observed data to estimate the lower dimen- 
sional structure of the object set. Generally, we have many 
more objects without classification labels than with (e.g, the 
SN Photometric Classification Challenge provided 14 unla- 
beled SNe per labeled SN). If we have continuously varying 
parameters that affect the shapes and colours of the light 
curves (e.g., redshift, extinction, etc.) then it is imperative 
that we use as many data points as possible to capture these 
variations when learning a low dimensional representation. 
We then use the examples for which we know the object 
type to estimate the classification model, which is finally 
employed to predict the type of each unlabeled object. 



2.1 Diffusion Map 

In this section, we review the basics of the diffusion map 
approach to spectral connectivity analysis (SCA) for data 
parametrization. We detail our approach of using diffu- 
sion map to uncover structure in databases of astronomi- 
cal objects. For more specifi cs on the diffusion map tech- 
nique, we refer the reader to ICoifman fc Lafonl (|2006T ) and 
lLafon fc Lee] |2006T ). For examples of the ap plication of dif- 
fusion map to problems in astr ophys ics. seelRichards et all 
j2009br).lRichards et all (1200981 ). and lFreeman et al.l (|2009F 7 



In Richards et al. I (|2009bh 7 the authors compare and con- 
trast the use of diffusion maps, which are non-linear, with 
a more commonly utilized linear technique, principal com- 
ponents analysis (PC A), and demonstrate the superiority of 
diffusion maps in predicting spectroscopic redshifts of SDSS 
data. 

The basic idea of diffusion map is the following. To make 
statistical prediction (e.g., of SN type) tractable, one seeks 
a simpler parameterization of the observed data, which is 
often complicated and high-dimensional. The most common 
method for data parameterization is PCA, where the data 
are projected onto a lower-dimensional hyperplane. For com- 
plex situations, however, the assumption of linearity may 
lead to suboptimal predictions because a linear model pays 
very little attention to the natural geometry and variations 
of the system. 

The top plot in Figure 1 of iRichards et~aH lj2009bh 
illustrates this by showing a data set that forms a one- 
dimensional noisy spiral in two dimensional Euclidean space. 
Ideally, we would like to find a coordinate system that re- 
flects variations along the spiral direction, which is indicated 
by the dashed line. It is obvious that any projection of the 
data onto a line would be unsatisfactory. If, instead, one 
imagines a random walk starting at x that only steps to 
immediately adjacent points, it is clear that the number of 
steps it takes for that walk to reach y reflects the length be- 
tween x and y along the spiral direction. This is the driving 
idea behind the diffusion map, in which the "connectivity" 
of the data in the context of a Active diffusion process, is 
retained in a low-dimensional parametrization. This simple, 
non-linear parametrization of the data is useful for uncov- 
ering simple relations with the quantity of interest (e.g., su- 
pernova type) and is robust to random noise in the data. 

We make this more concrete below. 

Diffusion map begins by creating a weighted, undirected 
graph on our observed photometric data {Xi, Xjv}, 
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where each data point is a node in the graph and the pair- 
wise weights between nodes are defined as 



w(Xi, Xj) = exp 



s(Xi, Xj 



(1) 



where e > is a tuning parameter and s(-, •) is a user-defined 
pairwise distance between objects. Here, s is a local distance 
measure, meaning that it should be small only if and Xj 
are similar (in £|3.3l we define the distance measure we use for 
SN light curve data). In this construction, the probability of 
stepping from Xi to Xj in one step of a diffusion process is 



Pi(Xi,X^ 



L)(Xi, Xj)/ J^ fe w(Xi, Xi,). We store the one 



step probabilities between all TV data points in the TV x TV 
matrix P; then, by the theory of Markov chains, for any pos- 
itive integer t, the element p t (Xi,Xj) of the matrix power 
gives the probabilit y of going from Xj to X, in t steps. 
See, e.g., Chapter 6 in Grimmett fc Stirzakerl (|200ll ) for an 
introduction to Markov chains. 

We define the diffusion map at scale t as 



* f :X^ [Ai*i(X),A t 2 * 2 (X),...,AL* m (X)] 



(2) 



where ^/j and Aj are the right eigenvectors and eigenvalues 
of P, respectively, in a biorthogonal spectral decomposition 
and m is the number of diffusion map coordinates chosen 
to represent the data. The diffusion map coordinates are or- 
dered such that Ai ^ A2 ^ ... ^ A m , so that the top m 
coordinates retain the most information about P*. Though 
there are TV total eigenvectors of P, only m <C TV are re- 
quired to capture most of the variability of the system. 

The Euclidean distance between any two points in the 
m-dimensional space described by equation ([2]) approxi- 
mates the diffusion distance, a distance measure that cap- 
tures the intrinsic geometry of the data set by simultane- 
ously considering all possible paths between any two data 
points in the i-step Markov random walk constructed above. 
Because it averages over all possible paths between data 
points in the random walk, the diffusion distance is robust 
to noise in the observed data. The choice of the parameters 
e (in equation [TJ and m gives the map a great deal of flexi- 
bility, and it is feasible to vary these parameters in an effort 
to obtain the best classifier via cross-validation, as described 
in JOJ 

We note that in the random forest classifier used to pre- 
dict object type (see i]2.2[) . the scale of each coordinate of 
^'(X) does not influence the method because each classi- 
fication tree is constructed by splitting one coordinate at a 
time. Therefore, the parameter t, whose role in the diffusion 
map (J2J) is to rescale each coordinate, has no influence on 
our analyses. We choose to fix t to have the value 1. For 
the remainder of this paper, we will use to stand for the 
m-dimensional vector of diffusion map coordinates, VE^X;). 



2.2 Classification: Random Forest 

In finding the diffusion map representation of each object, 
the idea is that this parameterization will hopefully obey a 
simple relationship with respect to object type. Then, sim- 
ple modeling can be performed to build a class-predictive 
model from the diffusion map coordinates. That is, once we 
have the m-dimensional diffusion map representation of each 
object's light curve (equation [2}, we construct a model to 



predict the type, j/;, of the i* object as a function of its dif- 
fusion map representation, VE^. In other words, using the set 
of n SNe with known classification labels, we estimate the 
underlying function h that relates each m-dimensional diffu- 
sion map representation with a classification label. We will 
ultimately use this estimate, h, to predict the classification, 
tjj — h(^fj) for each unlabeled supernova j = n + 1, TV. 

Any classification procedure that can handle more than 
two classes could be applied to the diffusion map representa- 
tion of the SNe. By adapting the predictors to the underly- 
ing structure in the data, standard classification procedures 
should be able to separate the SNe of different types. There 
are many classification tools available in the statistics and 
machine learning literature, ranging to simple K nearest- 
neighbor averaging to more soph isticated ensemble meth- 
ods (see iBloom fe Richards! l201ll for an overview of some 
of the classification methods that have been used in time- 
domain as tronomy) . We c hoose to use the random forest 
method of iBreimanl <|200lh due to its observed success in 
many m ulticlass classification settin g s, including in astro- 
physics dBreiman, Last fc Rice! 120031 ; iRichards et all l201ll ; 
iDubath et al.ll201ll ). The basic idea of the random forest is 
to build a large collection of decorrelated classification tree 
estimates and then to average these estimates to obtain the 
final predictor, h. This approach usually works well because 
classification tree estimates are notoriously noisy, but tend 
to be unbiased. By averaging decorrelated tree estimates, 
the random forest produces classification estimates that are 
both unbiased and have small variance with respect to the 
choice of training set. See the Appendices A and B for a 
brief overview of classification trees ( jyy an d random forest 



3 SEMI-SUPERVISED SUPERNOVA 

CLASSIFICATION: THE UNSUPERVISED 
PART 

The first step of our analysis is to use the entire dataset 
of supernova light curves to learn an appropriate represen- 
tation of each supernova using diffusion map. This is the 
unsupervised part of our semi-supervised approach. 



3.1 Data 

We apply the methods described in to the 

SNPhotCC+HOSTZ dat aset of the SN Ph otometric Clas- 
sification Challenge (jKessler et ajj l2010bT ). We use data 
from the "U p dated Simulations'] 2 ], as described in §6 of 
iKessler et ail (|2010al 1. These data were simulated to mimic 
the observing conditions of the DES. Note that these 
data are significantly different than the data used in the 
Challenge, with several bug fixes and improvements to make 
the simulations more realistic. For instance, the ratio of 
photometric to spectroscopic SNe was 13:1 in the Challenge 
data, but 18:1 in the Updated Simulations. Therefore, we 



2 Data can be downloaded at 
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refrain from comparing our current results directly to the 
results in the Challenge^ 

We denote the updated Challenge data as T>. There are 
a total of TV = 20,895 SNe in £0 and for each SN we have griz 
photometric light curves. These light curves are simulated 
by the Challenge organizers so as to mimi c the observing 
cond itions of the Dark Energy Survey fDES: lBernstein et al.l 
120091 ). The light curve for each SN is measured on anywhere 
between 16 to 160 epochs (between 4 and 40 epochs in each 
filter), with a median value of 101 epochs. The maximum r- 
band signal-to-noise ratio for the light curves ranges between 
0.75 and 56, with a median of 11. 

The data, T>, is originally comprised of two sets. One 
that we dub S contains 1,103 spectroscopically confirmed 
SN light curves, while the other, containing 19,792 photo- 
metrically observed SNe, is dubbed V '. 

3.2 Data preprocessing 

In order to utilize diffusion map in our classification scheme, 
we first need to formulate a distance metric, s, between ob- 
served SN light curves (see equation [TJ . Our distance mea- 
sure should capture differences in the shapes of the light 
curves and the colours of the SN. Phrased differently, be- 
cause the measure s is only considered on small scales (con- 
trolled by e in equation [1]), we wish to construct a measure 
that is small only if both the shapes of the light curves and 
the colours of two SNe are very similar. 

3.2.1 N on- Parametric Light Curve Shape Estimation 

Each SN is sampled on an irregular time grid that differs 
from filter to filter and from SN to SN. To circumvent this 
difficulty, we find a nonparametric estimate of the shape of 
the observed light curve, Xi = {t h i k , F b k ,o-\ k } , for each SN. 
With this estimate, we can shift from using the irregularly 
sampled observed data to using fluxes on a uniform time 
grid when computing the distances between light curves. 

We independently fit a natural cubic regression spline 
to the data from each filter, 6 , an d each SN, i (see, e.g., 
iRuppert? Wand fc Carrol] 120031 and IWassermanll2006l ). We 
utilize regression splines because they are particularly useful 
for estimating smooth, continuous functions that can have 
complicated behavior such as rapid increases. In doing so, we 
avoid assuming an overly restrictive template model for the 
shape of each light curve. We also leave the number of spline 
knots as a free parameter, allowing the model to adapt to 
the complexity of the true SN light curve shape. Our cubic 
spline estimator is 

1/+4 

F! k =F!(t k ) = J2^ j B j (t k ) (3) 
where Bj is the j th natural cubic spline basis, t k are points 

3 Specifically, we find that our methods perform 40% better on 
the Challenge data than on the data used in this paper in terms 
of the photometric la figure of merit. 

4 After removal of the 424 SNe simulated from the SDSS 2007nr 
II-P template, whose peak luminosities are anomalously dim by 
several magnitudes. 



along our uniform time grid, and v is the number of knots. 
The are estimators of the coefficients /8y and are fit 
from the observed Xi by minimizing weighted least squares 
against the observed fluxes F\ k with weights (l/a\ k ) 2 . By 
fiat, we choose a grid of 1 measurement per MJD, noting 
that we have found denser grids to have negligible effect on 
our final results while incurring increased computing time. 
Other implementations could use a sparser grid, resulting in 
faster computing times. 

When fitting regression splines, we must choose the 
quantity, u, and locations of the knots, which correspond to 
points of discontinuity of the third derivative of the spline 
function. We follow convention and place the knots uni- 
formly over the observed time points. To choose v, we min- 
imize the generalized cross-validation (GCV) score, defined 
as 

where F? k is the fitted value of the spline with v knots at t k 
computed using eq. (|3]). Minimizing eq. Q over v balances 
the bias and variance (bumpiness) of the fit. Note that the 
observational uncertainties in the measured fluxes, a\ k , are 
used to compute both the LC flux estimates, F^ k , and the 
model uncertainty in those estimates, a\ k . In £|3.3l we will use 
these entities to construct a distance metric used to compute 
the diffusion map representation of each SN. 

We have applied the spline fitting routine to a set of 
simulated light curves, and the method produces reasonable 
fits. For an example of the fit for a single supernova's griz 
light curves, see Figure [T] For a few SNe, minimizing the 
GCV led to too many knots (the estimated light curve was 
deemed too bumpy), so the parameter v is restricted to be 
no greater than 10 to ensure that the estimated curve for 
each band is physically plausible. 

Once we estimate a SN's raw light curve function, we 
normalize the flux to mitigate the effect that the observed 
brightness of each SN has on the pairwise distance estimates. 
The normalized flux of SN i is 

Ft = — . (5) 

Similarly, we normalize the model error, a\ k , associated with 
each spline-estimated flux F^ k ; call this o~\ k . Because the 
same divisor is used for each photometric band, the colour of 
each supernova is preserved. A distance measure constructed 
from {F b , of} will capture both the shape and colour of the 
light curve. 

3.2.2 Zero-Point Time Estimation 

In order to accurately measure the similarities and differ- 
ences in the shapes of the observed light curves, we must 
ensure that they are aligned on the time axis. To this end, we 
define the zero-point time of a SN light curve as its time, in 
MJD, of maximum observed r-band flux, t .i — m&x k {F[ k } . 
Shifting all the light curves to this common frame, but set- 
ting ti = U — t ,i enables us to construct a pair- wise distance 
measure that captures differences in observed LC shapes and 
colors. 
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Figure 1. Spline fit to the data of a single Type la supernova. 
A regression spline was fit independently to the light curve data 
in each band using GCV (equation [4} . The spline fit (solid red 
line) and model errors in that fit (blue dashed lines) are used to 
compute a pairwise supernova light curve distance {7J, which is 
used to find a diffusion map representation of each SNe. 



We estimate the zero-point time for a SN whose r-band 
maximum occurs at a time endpoint — either the first or last 
epoch of the observed light curve — on a case by case basis: 
if it is being compared to a light curve whose maximum r- 
band flux occurs either at the same endpoint or at neither 
endpoint, we use cross-correlation to estimate t ; if it is being 
matched with a SN whose r-band flux occurs at the opposite 
endpoint, we abort the zero-point estimation process and set 
the distance to a large reference quantity since the two light 
curves are, by definition, dissimilar. In our SN database, 
2908 (14%) of the supernovae are observed post r-band peak 
while 1489 (7%) are observed pre r-band peak. 

Our use of r-band data to establish a zero-point time 
is motivated by simulations of Type la SNe data. Let t de- 
note o ur estimator of t . U sing 1,000 SNela simulated with 
SNANA |Kessler et al.l [20091 ) . which are generated using the 
default settings for the DES, we can characterize this es- 
timator for each photometric band because the true time 
zero-point, to, of the simulations. Note that for the SNANA 
simulations, the true time of peak flux is computed in each 
band separately and then the results are combined using a 
filter-weighted average; thus, we are not actually comparing 
our estimator, t , to the true time of r-band maximum, but 
rather to a slightly different, redshift-dependent quantity. 
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Figure 2. The difference At = to — to (in days) versus redshift 
for 600 SNela simulated with SNANA using its default settings for 
the Dark Energy Survey. The 600 SNe are the peaked SNe from a 
simulated sample of 1,000 SNe. The top row shows g- and r-band 
data while the bottom row shows i- and z-band data. Overlaid on 
the r-band data is a linear regression fit to the data in the range 
—5 < At < 5 that is meant to be purely illustrative. 



Table 1. Time Normalization Estimator: Different Filters. 



Filter 


At (days) 


a (days) 


P 


g 


13.86 


27.26 


0.66 


r 


-0.27 


2.17 


-0.67 


i 


2.21 


2.65 


-0.27 


z 


3.25 


2.84 


0.30 



See Figure [2] and Table [T] for the results for the 600 
SNe in the sample with peaks (i.e., the regression spline 
maximum is not at an end point). The z-axis of Figure [2] 
shows At — t — t , while the second and third columns of 
Table[T]show the estimated mean and standard deviation for 
At. (The fourth column indicates the estimated correlation 
between At and redshift z.) We find that using the time of 
the r-band peak flux as our estimator t is the proper choice: 
it is nearly unbiased and it has the minimum variance. 

In Figure [3] and Table [2] we characterize t , the r-band 
max estimator of to, given 1,000 examples each of differ- 
ent SN types, each observed in the r band. The conclusion 
that we draw is that if we corrected t using host photo- z 
estimates, the effect on other SN types would be minimal, 
judging by the standard deviations shown in Table [2] 

Table 2. Time Normalization Estimator: Different SN Types 



SN Type 


P (days) 


a (days) 


P 


la 


-0.27 


2.17 


-0.67 


Ibc 


2.93 


11.36 


-0.24 


Iln 


1.97 


9.57 


-0.03 


II-P 


32.99 


31.76 


-0.102 


II-L 


-2.19 


12.64 


-0.297 
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Figure 3. The difference At = t — t (in days) versus redshift 
for peaked SNela simulated with SNANA using its default settings 
for the Dark Energy Survey. The top row shows r-band data for 
SN Types la, Ibc, and Iln while the bottom row shows r-band 
data for SN Types II-P and II-L. Overlaid on the SNela results is 
a linear regression fit to the data in the range — 5 < At < 5 that 
is meant to be purely illustrative. 



Simulated SNe light curves without a peak in the r-band 
are treated differently from those with peaks. To estimate 
t for a given un-peaked light curve, we cross-correlate it 
with with each SN that has an r-band peak. This produces a 
sequence of estimates of t a , [t j)jLi, where M is the number 
of peaked SN. Finally, we return 

1 A/ 

as the estimate of t a for that unpeaked light curve. To 
examine our use of cross-correlation, we return to the 
SNANA simulated SNe described above. In Figure [4] we 
plot At for using the endpoint and using cross-correlation 
for each of the 400 unpeaked SN. Without cross-correlation, 
= (—5.02,4.76); with cross-correlation, the values are 
(0.87,5.63), i.e., a small increase in estimator standard de- 
viation is offset by a marked decrease in bias. Thus we con- 
clude that cross-correlation is an effective approach. 
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Figure 4. The difference At = t — t (in days) versus redshift 
for 400 SNcfa simulated with SNANA using its default settings for 
the Dark Energy Survey. The 400 SNe are the unpeaked SNe 
from a simulated sample of f ,000 SNe. The left panel shows the 
distribution of At values assuming the time of the first r-band 
datum for t , while the right panel shows the same distribution 
except that t is estimated by cross-correlating the unpeaked light 
curve with all peaked light curves in the sample and taking the 
mean (see cq. |6j . 



define the squared 6-band distance between them, where b 
ranges over griz, as 



ti 



E 



r ik 



F b 



(^) 2 + i^k 



(7) 



where k indexes the time grid of the smoothed super- 
nova light curves; a time binning of 1 day is used. Hence, 
Sb(Xi, Xj) is the weighted Euclidean distance between light 
curves, per overlapping time bin. The quantities ti = 
max(t;i,tji) and t u — min(ti Pi , t j P . ) define the lower and 
upper time bounds, respectively, of the overlapping regions 
of the two light curves. If two normalized light curves have 
no overlap, then their distance is set to a large reference 
value. Finally, we define the total distance between two light 
curves, as s(Xj,Xj) = £\ Sj,(Xj, Xj), the sum of the dis- 
tances in eq. (0, across bands. This distance is used in 
eq. (JTJ to build the weighted graph which we use to com- 
pute the diffusion map parametrization, {^i, ^jv}, of 
our data (equation [2} ■ Each VE*; are the coordinates of the 
i th SN in the estimated diffusion space. 



4 SEMI-SUPERVISED SUPERNOVA 

CLASSIFICATION: THE SUPERVISED PART 

Once we have a parametrization, <!', for each supernova, the 
next step is to use a training set of SNe of known, spec- 
troscopically confirmed type to learn a classification model 
to predict the type of each supernova from its representa- 
tion \E'. This is the supervised part of our semi-supervised 
methodology. 



3.3 Light Curve Distance Metric 

Call the set of normalized light curves for supernova i 
{tik, Fffc, of fe }. For each pair of supernovae, Xi and Xj, we 



4.1 Constructing a Training Set 

In the Supernova Challenge, the training set, S, was gener- 
ated assuming a fixed amount of spectroscopic follow-up on 
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Table 3. Spectroscopic integration times assumed for follow-up 
observations of supcrnovae, based on average conditions at the 
VLT using the FORS instrument. 



r-band mag 


integration time (minutes) 


20 


1 


21 


2 


22 


5 


23 


20 


24 


100 


25 


600 


25.5 


1500 



each of a 4m and 8m-class telescope. The magnitude lim- 
its were assumed to be 21.5 ( r-band) for the 4m and 23.5 
(i-band) for the 8m telescope |Kessler et al.|[2010al ). 

Using 5 as a training set is problematic for at least 
two reasons. First, S consists of SN that have much lower 
host-galaxy z an d higher observed br ightness than those in 
V (see Fig. 2 in iKessler et alj|2010af ). Second, S oversam- 
ples the number of Type la SN relative to that in the entire 
data set, T> (see Table [SJ. These distributional mismatches 
induce inadequate modeling in those parameter subspaces 
undersampled by S and can cause model selection proce- 
dures to choose poor models for classification in V. Both of 
these issues hinder attempts to generalize models fit on S to 
classify supernovae in V. 

We study the dependence of the classification accuracy 
on the particular training set employed (or more precisely, on 
the specific procedure used to perform spectroscopic follow- 
up). In this section, we propose a variety of procedures to 
procure spectroscopically confirmed labels; in ij5] we will 
analyse the sensitivity of our classification results to the 
training set used and determine the optimal strategy for 
spectroscopic SN follow-up. 

We phrase the problem in the following way: assum- 
ing that we have a fixed number of hours for spectroscopic 
follow-up, what is the optimal way to use that time? To 
simplify our study, we assume that all follow-up is per- 
formed on an 8m telescope with spectroscopic integration 
times given in Table [3l based on simplistic simulations un- 
der average conditions using the FOcal Reducer and low 
dispersion Spectrograph for the Very Large Telescope (VLT 
FORS) exposure time calculator 0. The amount of spectro- 
scopic follow-up time necessary for each target is determined 
by the integration time (Table[3]) corresponding to its r-band 
maximumlj These integration times are meant to be approx- 
imate figures for means of constructing simulated spectro- 
scopic training sets. 

Under this scenario, the spectroscopic data in S require 
a total of 24,000 minutes (400 hours) of integration time. 
Assuming this amount of follow-up timfl we construct al- 
ternate training sets using each of the following procedures: 



(i) Observe SNe in order of decreasing brightness. This 
strategy observes only the brightest objects, and allows us 
to obtain spectra for the maximal number of supernovae. 
Call this Sb- 

(ii) Perform a (r-band) magnitude-limited survey, down 
to a prespecified magnitude cutoff. Here, we randomly sam- 
ple objects brighter than the magnitude cut, until the ob- 
serving time is filled. We experiment with four different 
cutoffs: 23.5, 24, 24.5, and 25th magnitude. Call these 

O m ,23.5, Sm,24, O mj 24.5, and S m ,25- 

(iii) Perform a redshift-limited survey. We try two differ- 
ent redshift cutoffs: 2=0.4, 0.6. Call these 5 z ,o.4, and S z ,o.e- 

The magnitude- and redshift-limited surveys both have a 
random component. To quantify the effects that this ran- 
domness has on the samples and the ultimate supernova 
typing, we construct 15 data sets from each spectroscopic 
"survey". In Table [4] we display the median number of SNe 
from each of S and V contained in each spectroscopic train- 
ing set. Note that as the limits of the survey get fainter 
(and higher z), the ratio of elements from V to S increases, 
as the total number of SNe decreases. Table[5]shows the me- 
dian class composition of each training set. Here, the deeper 
training sets more closely resemble the class distribution in 
T>. We will return to these training sets in |j5j 



4.2 Tuning the Classifier 

We build a random forest classifier, h, by training on the 
diffusion map representation and known type of a set of 
spectroscopically confirmed SNe. This classifier then allows 
us to predict the class of each newly observed SN light curve. 
In constructing such a classifier, there are three tuning pa- 
rameters which must be chosen: 

(i) e, the diffusion map bandwidth in eq. {TJ, 

(ii) m, the number of diffusion map coordinates used in 
the classifier, and 

(iii) 71a, the minimum proportion of random forest trees 
predicting that a SN is of Type la necessary for us to decide 
that the SN is Type la. 

In this section we describe how to choose these parameters 
in a statistically rigorous way. 

The SN Classification Challenge was based on correctly 
predicting the Type la supernovae. The Challenge used the 
Type la Figure of Merit (FoM) 



/la 



« U6 ) 2 



N true + WN U 



(8) 



5 http : / /www. eso . org/observing/etc/bin/gen/f orm?INS . NAME= 

e For non-integer magnitudes, the integration times are deter- 
mined by a quadratic interpolation function. 

7 Note that we fix the total amount of integration time needed, 
ignoring the time lost to overheads, weather, and other effects. 



where iVj^ a is the total number of Type la SNe in the sam- 
ple, Nil ue is the number of Type la SNe correctly predicted, 
and Njf BC is the number of non-la SNe incorrectly predicted 
to be Type la. The factor W controls the relative penalty on 
false positives over false negatives. For the SN Photometric 
Classification Challenge, W = 3. This penalty on Type la 
purity means that we need to be conservative in calling a 
SN a Type la: we are penalized three times more by calling 
a non-Ta a Tvne Ta than in calling a Type la a non-la. 

fcmcc rhc Challenge gives an explicit criterion (equa- 

tion[S|, we choose the tuning parameters that maximize /i a . 
To avoid overfitting to the training set, we maximize a 10- 
fold cross-validation estimate of /i a and call this maximum 
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Table 4. Composition of the training datascts, broken down by the SN Challenge spectroscopic/photometric designation. In the first 
two rows, each cell's entry shows the number of elements in the set that is the intersection of S or V with the respective training set. In 
the third row, the total number of objects in each training set is given. 



Set 


S 


S B 


<S m , 23. 5 


5m, 24 


5m, 24. 5 


5m, 25 


S z ,0A 


5 2 ,0.6 


V 


S 


1103 


686 


294 


73 


26 


11 


44 


15 


1103 


V 





1765 


979 


508 


272 


155 


240 


117 


19792 


Total 


1103 


2451 


1273 


587 


302 


165 


284 


135 


20895 



Median values are displayed over 15 training sets. Only for S and Sb are the training set identical on each iteration. 
Table 5. Composition of the training datasets, broken down by the SN Challenge spectroscopic/photometric designation. 



SN Type 


S 


S B 


Sm, 23. 5 


<Sm,24 


Sm, 24.5 


Sm,25 


5 Z ,0.4 


5 Z ,0.6 


V 


la 


559 


1313 


557 


178 


75 


39 


45 


22 


5088 


Ib/c 


15 


18 


13 


5 


1 


2 


7 


1 


259 


lb 


71 


168 


79 


32 


15 


8 


30 


10 


1438 


Ic 


58 


88 


50 


26 


13 


8 


36 


14 


1104 


Iln 


63 


131 


84 


42 


24 


14 


10 


5 


1939 


II-P 


326 


707 


480 


295 


161 


90 


159 


74 


10642 


II-L 


11 


26 


12 


5 


3 


2 


17 


6 


425 



Median values are displayed over 15 training sets. Only for S and Sb are the training set identical on each iteration. 



value /j* a . We find that / Ia is insensitive to the value of m 
for a large enough m, as the random forest largely ignores 
irrelevant components, so for the rest of the analysis we fix 
m — 120. To find the optimal model, (e*,ij a ), we perform a 
two dimensional grid search over (e,7i a ). Once this optimal 
model is discovered by minimizing the cross- validated /i a , it 
is applied to the photometric sample to predict the class of 
each supernova. 

4.3 Incorporating Redshift Information 

In addition to observed light curves, host-galaxy (photomet- 
ric) redshift estimates, z, are often available. If this informa- 
tion is known, it should be included in the supernova anal- 
ysis. We may incorporate redshift information in one of two 
ways: 

• directly in the calculation of the pairwise distance met- 
ric, s; or 

• as an additional covariate for the classification model, 

h. 

In the former approach, we artificially inflate the distance 
measure between two SNe, i and j, if 



where u denotes the estimated redshift uncertainty, and n a 
is a set constant (e.g., 3). Using eq. @, we effectively deem 
two supernovae 'dissimilar' if their redshifts are greater than 
n a standard deviations apart, even if their light curve dis- 
tance, s, is small. This approach can aid to break redshift 
degeneracies in light curve shapes and colours. 

In the latter approach, we simply use z as a covariate in 
the classification model in addition to the diffusion map co- 
ordinates. This treatment of redshift allows the classification 
model to directly learn the dependencies of supernova type 



on redshift. However, this method relies on the assumption 
that the training SN distribution resembles that of the set 
to be classified (and, crucially, that the entire range of red- 
shifts is captured by the training data). The first approach 
does not rely on such assumptions because the redshifts are 
employed, in an unsupervised manner, to alter the SN rep- 
resentation prior to learning a classification model. 

In Tables [S] and O we show the classification results, 
on the SN Challenge data, of incorporating redshift into our 
analysis using each of the two methods. The strategy of using 
redshifts to directly influence the diffusion map coordinates 
performs better than including the redshifts as a covariate 
in the classification model, and both methods of incorpo- 
rating redshift information provide improved classifications 
(see 3541) . 



4.4 Computational Resources 

Though fitting the above described method can take sub- 
stantial computational resources, the splines, spectral de- 
composition, and random forest have very stable and effi- 
cient implementations that are readily used and highly op- 
timized. The dominating computation remains computing 
the distance matrix, with the bottleneck being the cross- 
correlation approach to estimating the time of maximum 
flux of a non-peaked SN light curve. 



5 RESULTS 

In this section, we analyse 20,895 supernovae provided in 
the Supernova Photometric Classification Challenge using 
the methods introduced in ^ £12141 
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Figure 5. Top: Diffusion coordinates for all (spectro- 
scopic+photometric) SNe, of Type la (red), Iln+II-P+II-L 
(green), and Ibc+Ib+Ic (blue). Bottom: Diffusion coordinates for 
only the spectroscopic SN sample. Left: In the first two coordi- 
nates, there is clear separation between the Type II and Type I 
SNe. Right: Plotted are the two most important diffusion coor- 
dinates for SN classification, as determined by a random forest 
classifier built on the training set. Clear separations between SN 
types in the spectroscopic set are not present in the combined set. 



5.1 Diffusion Map Representation of SNe 

The first step in this analysis is to compute the diffusion map 
representation of each SN. In Figure[5]we plot diffusion map 
coordinates, coloured by true class, for the entire SN set, T> 
(top) and spectroscopic set, S (bottom). These coordinates 
were computed without using host photometric redshifts. 
Large discrepancies between the diffusion map distributions 
of these two sets are obvious. The left panels of Figure[5]show 
the first two diffusion coordinates, where clear separation 
between the Type I and Type II SNe is obvious. The right 
panels plot the two most important coordinates (3 & 7), as 
estimated by a random forest classifier trained only on the 
spectroscopic data. With respect to these two coordinates, 
there is separation between the Type la and Ibc supernovae 
in the spectroscopic set. 

In Fig. [6] we plot the average 4-band light curves of the 
SNe within each of 4 regions in diffusion space. Stepping 
across this coordinate system (with respect to coordinates 3 
and 7) , we see that the relative strength of the gr-band emis- 
sion increases and the light curves become more plateaued. 
This corresponds to a gradual transition from Type la to 
Type II SNe. 

We also explore the behavior, in diffusion space, of SN 
redshift. Fig.[7|plots the first two diffusion coordinates of the 
5088 la SNe, coloured by their true redshift. Even though 
we did not use any redshift information in the computation 
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Figure 6. Top: Supernovae in the spectroscopic sample S show 
a separation between the various supernova types in diffusion co- 
ordinates 3 and 7. Bottom: The average griz light curves in each 
of the four boxes B1-B4 are plotted, revealing a gradual flatten- 
ing of the light curves and an incremental increase in the relative 
strength of the (/-band emission. 



of these coordinates, we see a gradual trend in redshift with 
respect to this coordinate system, with an increase in z as 
one moves diagonally from bottom right to top left. This 
means that our construction of distance measure for the dif- 
fusion map captures the incremental changes with respect 
to redshift that occur in the light curves. Note that using 
the entire set of data to estimate the diffusion map coor- 
dinates has allowed us a fine enough sampling with respect 
to redshift; this would generally not be the case if we were 
to employ only the spectroscopic SNe to build the diffusion 
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Figure 7. Redshift of all 5088 Type la supcrnovae, plotted in 
the first two diffusion map coordinates. The true redshift varies 
gradually across diffusion space, even though this information was 
not used to construct the diffusion map. 

map. This is a critical result because it shows that the main 
source of variability in diffusion space is due to SN type, and 
not redshift. Hence, we can use the diffusion map represen- 
tation of the SNe directly, without having to estimate the 
redshift of each SN and correct for its effects on the light 
curves. In EI5.4I we will compare this result with using host- 
galaxy photo-z estimates to directly alter the diffusion map 
coordinates. 

5.2 Classification of Type la SNe 

Given the diffusion map representation of the entire set of 
SNe, T> (Fig. [5j top) and a training set (see H4.1[l , we train 
a classification model to predict the types of the remaining 
supernovae. To achieve this, we tune the classifier on our 
training set, as in £|4.2I and then apply this classifier to V . 

In Table [6] we display the results of Type la SN classifi- 
cation with the eight training datasets described in H4.ll For 
each training set, we report the optimal training set Type-la 
FoM, / : * a , and tuning parameters, (e*,t* a ), and the Type-la 
FoM, /ia.prod, of the predictions for P0 Additionally, we 
show the purity and efficiency of Type-la classification, de- 
fined as 

P la = Arti-iic T /vfalse 
JV Ia " t_iV Ia 

and 

jytruc 

= ^pal- (11) 



For each training set, we evaluate /i a ,pred f° r on ly those points 
in V not included in the training set. 



Note that all entries in Table [6] are median values over 15 
repetitions of each training set(j 

Results of this experiment show that the deeper 
magnitude-limited follow-up strategies perform the best, 
achieving a /i a , P rcd of 0.305, a value 2.4 times the FoM 
achieved by the classifier trained on S. For the iS m ,25 training 
procedure, (pi a , P red, ei a>P red) = (0.72,0.65). For each train- 
ing set, the cross- validated Type la figure of merit estimated 
on the training data is significantly larger than the figure 
of merit achieved on the photometric data, showing that 
none of the training sets completely resembles V ■ Notably, 
on the Challenge training set, 5, our method achieves a 
cross- validated Type la purity /efficiency of 95%/87%, but 
this transfers to a purity /efficiency of 50%/50% on the data 
in V. 

Figure [5] displays the distribution of Type la FoM, pu- 
rity, and efficiency for each of the spectroscopic follow-up 
procedures. A few observations: 

• The deeper magnitude-limited surveys perform the best 
in terms of FoM. The obvious trend is that deeper surveys 
perform better, even though their training sets are much 
smaller. For instance, the 23.5 mag. limited survey contains, 
on average, 7.7 times the number of training SNe as the 25 
magnitude limited survey, but attains a prediction FoM a 
mere 19% as large. 

• Compared to the Challenge training set, S, the 25th 
magnitude-limited survey attains a 44% increase in purity 
and 30% increase in efficiency of Type la SNe. 

• The worst strategy is to follow-up on only the brightest 
SNe. Though this maximizes the number of labelled super- 
novae, it produces the smallest figure of merit. 

• Redshift limited surveys are suboptimal to magnitude- 
limited surveys. Though this strategy results in Type la sam- 
ples with high purity, the efficiency of these samples is very 
low. A redshift-limited survey is not ideal for our approach 
because we have not directly modeled the redshift depen- 
dence of SN light curves. Without the ability to capture high 
z SNe, our model becomes overly conservative, resulting in 
low efficiency. 

• Though there can be large variability in the actual sam- 
ples produced by the magnitude-limited surveys, the Type 
la FoM is very stable, with FoM interquartile range typically 
smaller than 0.05. 

Based on this study, we recommend that deeper magnitude- 
limited follow-up strategies be used to attain SN training 
samples. Using a 25th magnitude-limited follow-up proce- 
dure yields a la FoM of 242% that of the shallower procedure 
used in the SN Challenge. 

5.3 Type II-P Classification 

Type II-P supernovae are also useful for cosmological 
studies because they can be used a s standard candles 
|Poznanski. Nugent fc Filippenkol |2010T ). Here, we classify 
Type II-P supernovae in the SN Challenge data set using 
the above methods and analogous Type II-P figure of merit. 

9 Training sets S and Sb are the same for each iteration, but 
results differ slightly on each iteration due to randomness in the 
random forest classifier. 
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Table 6. Results of Classifying Type la Supernovac using training 
sets from i|4,ll 



Training Set 


e* 


Ia 


fL 


</"la,pred 


Pla,pred 


^-Ia,pred 


S 


1.4 


0.58 


0.757 


0.126 


0.503 


0.497 


S B 


1.4 


0.65 


0.840 


0.011 


0.240 


0.125 


Sm, 23. 5 


1.4 


0.54 


0.796 


0.058 


0.404 


0.316 


5m, 24 


1.0 


0.46 


0.728 


0.155 


0.605 


0.459 


Sm, 24. 5 


1.0 


0.45 


0.610 


0.250 


0.730 


0.501 


5m, 25 


1.0 


0.37 


0.494 


0.305 


0.724 


0.654 


Sz, 0.4 


1.4 


0.41 


0.664 


0.061 


0.896 


0.085 


S z ,0.6 


1.2 


0.29 


0.600 


0.112 


0.772 


0.249 



fj is computed on training set via 10-fold cross-validation. 

/ia.predi Pla.prcd: an d e la,pred are evaluated on all data in the 
photometric set V not in the training set. 



Table 7. Results of Classifying Type II-P Supernovac using train- 
ing sets from £|4,ll 



Training Set 


€* 


*IIP 


•'iip 


ZllP.pred 


PllP.pred 


e IIP,pred 


S 


1.6 


0.55 


0.834 


0.218 


0.866 


0.319 


S B 


1.6 


0.49 


0.835 


0.203 


0.820 


0.337 


<Sm, 23. 5 


1.6 


0.54 


0.826 


0.286 


0.865 


0.430 


Sm,24 


1.2 


0.59 


0.791 


0.491 


0.896 


0.648 


S m ,24.5 


1.0 


0.52 


0.747 


0.586 


0.868 


0.835 


<S m ,25 


1.0 


0.48 


0.593 


0.532 


0.845 


0.862 


Sz,0A 


1.4 


0.57 


0.745 


0.289 


0.844 


0.456 


Sz, 0.6 


1.2 


0.55 


0.660 


0.383 


0.838 


0.673 



/j* Ip is computed on training set via 10-fold cross-validation. 

/ilP.pred) PllP,pred> an d EllP.pred are evaluated on all data in the 
photometric set T not in the training set. 



In Table[7]we display the results of Type II-P supernova 
classification with each of the 8 training sets, and in Fig. 
[9] we plot the distribution of the Type II-P FoM, purity 
and efficiency with respect to each spectroscopic follow-up 
strategy. Much like the Ia study, we find that the deeper 
magnitude-limited surveys perform the best. We find that 
in terms of Type II-P figure of merit, a 24.5th magnitude 
limited survey performs the best, achieving /iip, pr cd = 0.586, 
with purity and efficiency at 87% and 84%, respectively. 

Qualitatively, the Type II-P figures of merit resemble 
those of the type Ia study. For each training set, the purity 
of the classifications is quite high, above 80%, while the ef- 
ficiency differs greatly, from a minimum of 32% for 5 to a 
maximum of 86% for iS m ,25- Compared to the training set, 
S, used in the SN Challenge, a 24.5th magnitude-limited 
survey achieves only a slightly better II-P purity, but a 1.6 
times increase in II-P efficiency. 

5.4 Incorporating Host Redshift 

Finally, we study the performance of the two methods of 
incorporating host-galaxy redshifts ( 34.30 . In Tables [8] and [9] 
we show the results of classifying Type Ia and II-P SNe, re- 
spectively, using each of the two strategies for incorporating 
redshifts. Results are shown for each of the 8 training sets, 
where the optimal redshift cutoff, n 3 in eq. (J5J), was chosen 
by maximizing the cross-validated training set FoM over a 
grid of integer values from 2 to 6. 



Type la Classification on Set P 



CO 

ci 




■ Purity 

^ - ■ Efficiency 

Figure 8. Performance on Type Ia SNe in V of classifiers trained 
on spectroscopically confirmed supcrnovae from 8 different follow- 
up procedures. Top: Type Ia Figure of Merit, /i a , pr ed- Bottom: 
Type Ia purity and efficiency. Boxplots show the distribution of 
each metric over 15 training sets. The obvious winner is the deeper 
magnitude-limited survey, which achieves median purity and effi- 
ciency between 65-75%. 



There is a clear improvement to the FoMs by including 
host-galaxy redshifts. Compared to the non-redshift results 
in Tables 071 the FoM values increase for every training set 
by the use of redshift to alter the diffusion map coordinates. 
Using redshifts to alter the diffusion map coordinates con- 
sistently performs better than using redshift as a covariatc. 
Overall, the best strategy for including host-galaxy redshifts 
for Type Ia classification is to use eq. © with n 3 — 2 on 
the training set S m ,25- Using this prescription yields a Type 
Ia FoM of 0.355, an improvement of 16% over the best FoM 
without redshift. For Type II-P classification, the best strat- 
egy is to use n s = 4, resulting in a FoM of 0.612, which 
represents an improvement of 4.4%. 

We plot the Type Ia FoM, purity, and efficiency as a 
function of redshift in Figure [10] for the analysis without 
using photometric redshifts, and in Figure [11] for the anal- 
ysis incorporating photo-z's with n 3 — 2. Within each of 
7 equally sized redshift bins, the median performance mea- 
sures are plotted. The magnitude-limited training set ex- 
periences improvements in performance for both the low- 
est and highest redshifts. The purity of the Type Ia sample 
found using the magnitude- limited training sample improves 
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Figure 9. Same as Figurc[8]for Type II-P classification. Here, a 
24.5th magnitude-limited survey attains the maximal II-P figure 
of merit. Each spectroscopic training strategy results in a high 
II-P purity, but a much different classification efficiency. 



significantly at the lowest and highest redshifts by incorpo- 
rating redshifts. This highlights the tremendous value that 
host redshifts have in Type la classification, especially at 
breaking degeneracies between supernova type and redshift. 



6 SUMMARY AND CONCLUSIONS 

In this paper, we introduce the first use of semi-supervised 
classification for supernova typing. Most of the previous 
methods have relied on template fitting. Only recently, due 
in large part to the Supernova Classification Challenge, 
other statistics and machine learning methods been used 
for SN typing. Our semi-supervised approach makes efficient 
use of the data by using all photometrically observed super- 
novae to find an appropriate representation for their classifi- 
cation. Also, we show that the complex variation in SN light 
curves as a function of redshift is captured by this represen- 
tation In this manner, classification accuracy will improve 
as both the number of observed SNe grows and the param- 
eters such as redshift, stretch, and reddening are sampled 
more densely. It is not clear how this adaptation can occur 
for existing methods, where either a fixed set of templates is 
used, or sets of summary statistics are extracted from each 
light curve independently. 

Another advantage of our approach is the flexibility in 
the choice of distance measure, s, in the diffusion map con- 
struction. In our analysis, we used only the shapes of the 
light curves and colours of the SN to define s (eq. [7]) and 
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Figure 10. Performance of Type la classification as a function of 
redshift, for classification without using host-galaxy photometric 
redshifts. In 7 equal sized redshift bins, the median FoM (top), 
purity (bottom left), and efficiency (bottom right) are plotted for 
three different training sets. All three training sets result in poor 
estimates for high redshift SNe. The magnitude-limited training 
set 5 m ,25 (solid line) performs poorly for low redshifts, but the 
best for redshifts >~ 0.5. The Challenge training set (S, dashed 
line) performs poorly for all redshifts, while the redshift-limitcd 
training set <S Zj o.6 (dotted line) performs the best for low redshifts 
but declines sharply to zero after reaching its redshift limit. 



Table 8. Results of Classifying Type la Supernovac incorporating 
host redshifts. 



Tr. Set 


K 


e* 


t* 

In 




/la,pred 


Pla,prcd 


^Ia,prcd 


S 


2 


1.1 


0.53 


0.871 


0.249 


0.539 


0.9 






1.2 


0.59 


0.84 


0.131 


0.446 


0.584 


S D 


6 


1.2 


0.55 


0.903 


0.029 


0.24 


0.308 






1.6 


0.58 


0.899 


0.014 


0.18 


0.213 


S m , 23. 5 


6 


1.4 


0.53 


0.873 


0.108 


0.463 


0.482 






1.4 


0.57 


0.848 


0.06 


0.369 


0.37 


S m ,24 


1 


1 


0.47 


0.799 


0.252 


0.726 


0.564 






1.4 


0.51 


0.733 


0.153 


0.633 


0.463 


S m , 24. 5 


5 


1.2 


0.44 


0.689 


0.315 


0.769 


0.594 






1 


0.43 


0.649 


0.232 


0.716 


0.503 


<Sm, 25 


2 


1.2 


0.39 


0.615 


0.355 


0.758 


0.741 






1 


0.39 


0.54 


0.308 


0.732 


0.688 


Sz,0.4 


4 


1.1 


0.45 


0.649 


0.065 


0.923 


0.078 






1.6 


0.41 


0.671 


0.058 


0.87 


0.084 


S z ,0.6 


1 


1.4 


0.39 


0.562 


0.104 


0.831 


0.206 






1.6 


0.32 


0.591 


0.116 


0.761 


0.257 



n s = - indicates that host-galaxy redshift is used as a covariatc 
in the Random Forest classifier, and not to construct diffusion 
map. 

/j* is computed on training set via 10-fold cross-validation. 

/ia.pred, Pla.prcdi and ei a pre( j are evaluated on all data in the 
photometric set T not in the training set. 
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Figure 11. Same as Figure llOl using photometric redshifts to 
alter the diffusion map coordinates, with n s = 2. The magnitude- 
limited training set <S ?7li 25 (solid line) performs similarly across 
all redshifts except the highest redshift bin, besting the figure of 
merit for the Challenge training set (5, dashed line) for six of the 
seven redshift bins. Performance for the redshift-limited training 
set <S z ,o.6 (dotted line) declines sharply after reaching the redshift 
limit. 



Table 9. Results of Classifying Type II-P Supernovae incorpo- 
rating host redshifts. 



Tr. Set 


< 


e* 


Ia 




/llP.prcd 


PllP.pred 


e IIP,pred 


S 


6 


1.6 


0.55 


0.829 


0.221 


0.944 


0.261 






1.6 


0.55 


0.862 


0.219 


0.919 


0.275 


S B 


6 


1.4 


0.57 


0.795 


0.189 


0.946 


0.221 






1.8 


0.52 


0.85 


0.174 


0.889 


0.235 


S m ,23.5 


1 


1.1 


0.56 


0.821 


0.272 


0.913 


0.355 






1.8 


0.55 


0.849 


0.294 


0.882 


0.408 


S m , 24 


6 


1.6 


0.57 


0.818 


0.584 


0.923 


0.735 






1.6 


0.57 


0.787 


0.488 


0.891 


0.674 


Sm, 24.5 


1 


1 


0.55 


0.755 


0.612 


0.904 


0.807 






1.4 


0.54 


0.727 


0.569 


0.881 


0.789 


5m, 25 


4 


1.1 


0.51 


0.694 


0.582 


0.889 


0.835 






1.2 


0.51 


0.6 


0.543 


0.856 


0.845 


S z fiA 


6 


1.4 


0.59 


0.715 


0.245 


0.724 


0.552 






1.6 


0.57 


0.754 


0.315 


0.836 


0.493 


S z ,o.& 


1 


1 


0.55 


0.671 


0.298 


0.779 


0.606 






1.4 


0.57 


0.658 


0.375 


0.846 


0.67 



n s = - indicates that host-galaxy redshift is used as a covariatc 
in the Random Forest classifier, and not to construct diffusion 
map. 

/j*j P is computed on training set via 10-fold cross-validation. 

/np.prodi PllP,pred> an d e llP,pred are evaluated on all data in the 
photometric set T not in the training set. 



also showed how this distance can be modified if host-galaxy 
redshifts are available (eq. [9j . When using a diffusion map 
for SN classification, each astronomer is free to use their 
own choice of s, and presumably more sophisticated distance 
measures, such as ones that include more context informa- 
tion or more accurately capture the inter-class differences in 
SN light curves, will perform better. 

In applying our semi-supervised classification approach 
to data from the SN Classification Challenge, we find that 
results are highly sensitive to the training set used. We pro- 
posed a few spectroscopic follow-up strategies, discovering 
that deeper magnitude-limited surveys obtain the best clas- 
sification results-both for Type Ia and II-P SNe-despite ac- 
cruing labels for a smaller number of supernovae. Results 
show that our methods are competitive with the entrants 
of the SN Challenge as we obtain Type Ia purity/efficiency 
of 72%/65% on the photometric sample without using host 
redshifts, and 76%/74% using host redshifts. We hesitate 
to compare directly with results from that challenge due to 
large differences in the corrected set of data studied in this 
paper. 

Throughout this study, we attempted to avoid all use of 
SED templates. However, there is some physical knowledge 
that is impervious to SED template problems, such as cos- 
mological time dilation. Indeed, our method of incorporat- 
ing host-z using equation (j9} does account for time dilation: 
even if a high- z type Ia SN light curve appears like a low-2 
type II light curve, the degeneracy is broken because they are 
pulled apart in diffusion space since their redshifts are dif- 
ferent. Thus, our conclusion that deeper magnitude-limited 
surveys produce better training sets is not simply an artifact 
of neglecting to explicitly model time dilation: deeper train- 
ing sets still (tremendously) outperform shallower training 
sets even after incorporating host redshift (see Tables [HH9]) . 
The improved performance is likely due both to training 
on lower S/N data and capturing SED-dependent effects at 
high z. 

Though our template-free method allows us to avert 
classification errors that arise through use of wrong or in- 
complete template bases, as a trade-off we cannot extend 
results to redshifts outside of our training data, and thus 
require deeper training sets. This inability of our method 
to extrapolate beyond the training set is exhibited by the 
poor performance at high redshifts for the redshift-limited 
training sample (see Figures fTUI and [TT ]) . This suggests that 
our methods can be improved with increased use of physi- 
cal modeling; indeed the future may lie in a combination of 
both semi-supervised learning and template methods. 

In a future work, we plan to research such hybrid mod- 
els, which combine the semi-supervised approach presented 
in this paper with physical SN models, in hopes to decrease 
the dependency of our results on the specific training set 
employed and to improve the overall classification accuracy. 
One possible approach to this is to include, along with all of 
the observed SNe, sets of supernovae simulated from differ- 
ent templates with different values of the relevant parame- 
ters. This method would allow both the observed supernova 
light curves, along with the templates, to discover the opti- 
mal diffusion map representation of the SNe. Then, in the 
supervised part of the algorithm, the classification model 
would be able to learn from both the spectroscopic data 
and the simulated supernovae, allowing the model to extend 
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more widely (e.g., to higher redshifts than those sampled by 
the training set). 
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APPENDIX A: CLASSIFICATION TREES 

We start with a brief description of classification trees 
(|Breiman|[l984T) . Start with the n supernovae with known 
class labels. These data are pairs (tyi.yi), .... (*& n ,y n ), 
where each ^f?i is the m-dimensional diffusion map repre- 
sentation from equation ([2]), and each classification yi is 
coded to take on a value in {1, K }. The classification 
tree consists of a series of recursive binary splits of the Tri- 
dimensional diffusion map space. Each split is performed on 
one coordinate, resulting in two nodes (two regions of in- 
put space). The predicted class label k £ {1, K}, at each 
terminal node, Rj is 

hj = argmax — S ' I(yt = k) (Al) 

where rij is the number of training points in Rj and /(•) 
is the indicator function. Thus, the predicted class label at 
each node in the classification tree is the supernova type 
with the largest proportion of data points in Rj. The tree 
classifier for any input \E' is 

fc to »(*,e) = 5^^J(*6J2 i ) ) (A2) 

3 

i.e., the prediction for Nl" is the predicted class for its terminal 
node, defined by equation (|A1I) . In equation (|A2I) . char- 
acterizes a tree in terms of the variable split and cutpoint at 
each node, and the predicted terminal node classifications. 

To build the classification tree, at each successive node 
we choose the variable and splitting point that produces the 
largest decrease in misclassification rate, defined as 

n 

L(&) = - V I( Vi ± h trco (*0) (A3) 
n — ' 

!=1 

Other versions of classification trees use Gini index (used by 
CART) or entropy (used by C4.5) to determine the splitting 
points. The splitting process is repeated until minj rij ^ 
n m in, giving us the estimated model (|A2fl . 

Tree models are simple, yet powerful, nonparametric 
classifiers. They work well even when the true function h 
that relates the input space to the class labeling is com- 
plicated, generally yielding estimates with very small bias. 
Furthermore, there are fast, reliable algorithms for fitting 
trees and prediction is very simple. However, tree models 



tend to have high variance. Small changes in the SN train- 
ing set, can produce very different estimated tree 
structure. This is a result of the hierarchical nature of the 
tree model: small differences in the top few nodes of a tree 
can produce wildly different structure as those perturbations 
are propagated down the tree. To reduce this instability, we 
use the random forest approach, which we now describe. 



APPENDIX B: RANDOM FORESTS 

Random forest is a committee method that builds a large 
collection, B, of decorrelated classification trees, each one 
fit on a subset of the data, and then averages them to ob- 
tain a final predictive model, h T { with reduced variance over 
any single classification tree. Random fore st is an improve - 
ment to bagging (bootstrap aggregation, iBreimanl Il996l ). 
a method that averages the predictions of B trees fit to 
bootstrapped samples of the data, because it attempts to 
decorrelate the B trees (without increasing the variance 
too much) by selecting a random set r < m of the input 
coordinates as candidates for splitting at each node dur- 
ing the tree building process. The net result of building 
decorrelated classification trees is that the final, averaged 
mod el will have smaller variance than t he bagging model 
(see lHastie. Tibshirani fc Friedman! 12009] . ch. 15 for a dis- 
cussion). The observations and coordinates that are not in- 
cluded in a given tree are referred to as "out of bag." 

After growing B decorrelated classification trees, 
^troc(4 r , ©i), /itree(^ r , ©s), we let them vote on the class 
label to determine the classification of supernova \E r . The 
predicted classification for \E' is the majority vote of these 
trees, 

hti(&) = majority vote j/w(*, f>)} fc ( B1 ) 

where ties are broken at random. Alternatively, we can keep 
track of the number of 'votes' for each class to estimate a 
measure of classification probability of each class for a super- 
nova. To fit random forest models to the diffusion map rep- 
resentations of SNe, we use the randomForest R packagg 10 !. 
An advantage to random forest is the relative robustness of 
the estimates to choices of the tuning parameters compared 
to other nonparametric classification techniques. In practice 
we build B = 500 trees, select r = \/m coordinates as split- 
ting candidates at each node, set the minimum node size 
to n m i n = 1 and take bootstrap samples of size n (where n 
is the number of observed supernovae). The dimensionality, 
m, of the diffusion map space and the diffusion map tun- 
ing parameter e can both affect classification results, so we 
optimize our training set misclassification error over (e, m). 
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